-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux semantics for fallocate() #2499
Conversation
This causes hole punching to work as it does in ext4. |
related to #326 |
ok, a question: does this mean that in practice rsync with --preallocate does use this now ? |
No. I should have been a bit more clear in the commit message. Being copy-on-write, there is no concept of pre-allocation in ZFS. This patch makes the |
OK, I see that the original patch wasn't intended to support |
I wrote a somewhat more comprehensive testing script: https://gist.github.com/dweeezil/e1b5d3953f4d4fefb888 which seems to the the Right Thing. |
xfstests has a few tests for this as well. I'll run it through and see how it does. |
This patch causes xfstests/127 to fail. 127 - output mismatch (see 127.out.bad)
--- 127.out 2014-07-17 15:45:42.695156001 -0700
+++ 127.out.bad 2014-07-17 16:05:51.259156001 -0700
@@ -6,8 +6,180 @@
=== FSX Standard Mode, No Memory Mapping ===
All operations completed A-OK!
=== FSX Standard Mode, Memory Mapping ===
-All operations completed A-OK!
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 fsx_std_mmap
+READ BAD DATA: offset = 0x26627, size = 0x1167, fname = /tank/xfstests/fsx_std_mmap
+OFFSET GOOD BAD RANGE
+0x26627 0x0000 0xf514 0x 0
+operation# (mod 256) for the bad data may be 20
+0x26628 0x0000 0x1477 0x 1
+operation# (mod 256) for the bad data may be 20
+0x26629 0x0000 0x7714 0x 2
+operation# (mod 256) for the bad data may be 20
+0x2662a 0x0000 0x143b 0x 3
+operation# (mod 256) for the bad data may be 20
+0x2662b 0x0000 0x3b14 0x 4
+operation# (mod 256) for the bad data may be 20
+0x2662c 0x0000 0x144b 0x 5
+operation# (mod 256) for the bad data may be 20
+0x2662d 0x0000 0x4b14 0x 6
+operation# (mod 256) for the bad data may be 20
+0x2662e 0x0000 0x14f6 0x 7
+operation# (mod 256) for the bad data may be 20
+0x2662f 0x0000 0xf614 0x 8
+operation# (mod 256) for the bad data may be 20
+0x26630 0x0000 0x140c 0x 9
+operation# (mod 256) for the bad data may be 20
+0x26631 0x0000 0x0c14 0x a
+operation# (mod 256) for the bad data may be 20
+0x26632 0x0000 0x1450 0x b
+operation# (mod 256) for the bad data may be 20
+0x26633 0x0000 0x5014 0x c
+operation# (mod 256) for the bad data may be 20
+0x26634 0x0000 0x1495 0x d
+operation# (mod 256) for the bad data may be 20
+0x26635 0x0000 0x9514 0x e
+operation# (mod 256) for the bad data may be 20
+0x26636 0x0000 0x1481 0x f
+operation# (mod 256) for the bad data may be 20
+LOG DUMP (49 total operations):
+1( 1 mod 256): MAPWRITE 0x3db39 thru 0x3ffff (0x24c7 bytes)
+2( 2 mod 256): MAPREAD 0x2e947 thru 0x33163 (0x481d bytes)
+3( 3 mod 256): READ 0x2e836 thru 0x3cba1 (0xe36c bytes)
+4( 4 mod 256): PUNCH 0x2e7 thru 0x5c42 (0x595c bytes)
+5( 5 mod 256): MAPWRITE 0xcaea thru 0x13ba9 (0x70c0 bytes)
+6( 6 mod 256): PUNCH 0x31645 thru 0x38d1c (0x76d8 bytes)
+7( 7 mod 256): SKIPPED (no operation)
+8( 8 mod 256): MAPREAD 0x2ed6b thru 0x3413b (0x53d1 bytes)
+9( 9 mod 256): READ 0x907c thru 0xf535 (0x64ba bytes)
+10( 10 mod 256): WRITE 0x37786 thru 0x3b8e0 (0x415b bytes)
+11( 11 mod 256): MAPREAD 0x3a290 thru 0x3ffff (0x5d70 bytes)
+12( 12 mod 256): READ 0x22f86 thru 0x2d9aa (0xaa25 bytes) ***RRRR***
+13( 13 mod 256): SKIPPED (no operation)
+14( 14 mod 256): WRITE 0x32ae3 thru 0x3a478 (0x7996 bytes)
+15( 15 mod 256): MAPWRITE 0x5acb thru 0x843a (0x2970 bytes)
+16( 16 mod 256): SKIPPED (no operation)
+17( 17 mod 256): SKIPPED (no operation)
+18( 18 mod 256): READ 0x279cd thru 0x28094 (0x6c8 bytes)
+19( 19 mod 256): TRUNCATE DOWN from 0x40000 to 0x61da ******WWWW
+20( 20 mod 256): MAPWRITE 0x2517c thru 0x2d393 (0x8218 bytes) ******WWWW
+21( 21 mod 256): READ 0x293fe thru 0x2d393 (0x3f96 bytes)
+22( 22 mod 256): MAPWRITE 0x27b51 thru 0x2e5a2 (0x6a52 bytes)
+23( 23 mod 256): TRUNCATE UP from 0x2e5a3 to 0x2e744
+24( 24 mod 256): SKIPPED (no operation)
+25( 25 mod 256): SKIPPED (no operation)
+26( 26 mod 256): SKIPPED (no operation)
+27( 27 mod 256): MAPREAD 0xcdd4 thru 0x13e17 (0x7044 bytes)
+28( 28 mod 256): MAPWRITE 0x1a705 thru 0x1f5e0 (0x4edc bytes)
+29( 29 mod 256): MAPREAD 0x1c452 thru 0x1d63a (0x11e9 bytes)
+30( 30 mod 256): MAPWRITE 0x373fc thru 0x3760d (0x212 bytes)
+31( 31 mod 256): SKIPPED (no operation)
+32( 32 mod 256): PUNCH 0x14c02 thru 0x18bd0 (0x3fcf bytes)
+33( 33 mod 256): SKIPPED (no operation)
+34( 34 mod 256): PUNCH 0x26117 thru 0x30089 (0x9f73 bytes) ******PPPP
+35( 35 mod 256): SKIPPED (no operation)
+36( 36 mod 256): SKIPPED (no operation)
+37( 37 mod 256): WRITE 0x3845 thru 0xc58f (0x8d4b bytes)
+38( 38 mod 256): WRITE 0x2d2e9 thru 0x38bf7 (0xb90f bytes) EXTEND
+39( 39 mod 256): SKIPPED (no operation)
+40( 40 mod 256): WRITE 0x7a3 thru 0x3d5f (0x35bd bytes)
+41( 41 mod 256): WRITE 0xd6eb thru 0x14efe (0x7814 bytes)
+42( 42 mod 256): WRITE 0x201f thru 0x3bd0 (0x1bb2 bytes)
+43( 43 mod 256): SKIPPED (no operation)
+44( 44 mod 256): SKIPPED (no operation)
+45( 45 mod 256): SKIPPED (no operation)
+46( 46 mod 256): MAPREAD 0x15931 thru 0x1ce44 (0x7514 bytes)
+47( 47 mod 256): MAPREAD 0x410f thru 0xee3d (0xad2f bytes)
+48( 48 mod 256): READ 0x2002 thru 0x8e32 (0x6e31 bytes)
+49( 49 mod 256): READ 0x26627 thru 0x2778d (0x1167 bytes) ***RRRR***
+Correct content saved for comparison
+(maybe hexdump "/tank/xfstests/fsx_std_mmap" vs "/tank/xfstests/fsx_std_mmap.fsxgood")
=== FSX Standard Mode, No Memory Mapping ===
All operations completed A-OK!
=== FSX Standard Mode, Memory Mapping ===
-All operations completed A-OK!
+ltp/fsx -f -q -l 262144 -o 65536 -S 191110531 -N 100000 fsx_std_mmap
+READ BAD DATA: offset = 0x26627, size = 0x1167, fname = /tank/xfstests/fsx_std_mmap
+OFFSET GOOD BAD RANGE
+0x26627 0x0000 0xf514 0x 0
+operation# (mod 256) for the bad data may be 20
+0x26628 0x0000 0x1477 0x 1
+operation# (mod 256) for the bad data may be 20
+0x26629 0x0000 0x7714 0x 2
+operation# (mod 256) for the bad data may be 20
+0x2662a 0x0000 0x143b 0x 3
+operation# (mod 256) for the bad data may be 20
+0x2662b 0x0000 0x3b14 0x 4
+operation# (mod 256) for the bad data may be 20
+0x2662c 0x0000 0x144b 0x 5
+operation# (mod 256) for the bad data may be 20
+0x2662d 0x0000 0x4b14 0x 6
+operation# (mod 256) for the bad data may be 20
+0x2662e 0x0000 0x14f6 0x 7
+operation# (mod 256) for the bad data may be 20
+0x2662f 0x0000 0xf614 0x 8
+operation# (mod 256) for the bad data may be 20
+0x26630 0x0000 0x140c 0x 9
+operation# (mod 256) for the bad data may be 20
+0x26631 0x0000 0x0c14 0x a
+operation# (mod 256) for the bad data may be 20
+0x26632 0x0000 0x1450 0x b
+operation# (mod 256) for the bad data may be 20
+0x26633 0x0000 0x5014 0x c
+operation# (mod 256) for the bad data may be 20
+0x26634 0x0000 0x1495 0x d
+operation# (mod 256) for the bad data may be 20
+0x26635 0x0000 0x9514 0x e
+operation# (mod 256) for the bad data may be 20
+0x26636 0x0000 0x1481 0x f
+operation# (mod 256) for the bad data may be 20
+LOG DUMP (49 total operations):
+1( 1 mod 256): MAPWRITE 0x3db39 thru 0x3ffff (0x24c7 bytes)
+2( 2 mod 256): MAPREAD 0x2e947 thru 0x33163 (0x481d bytes)
+3( 3 mod 256): READ 0x2e836 thru 0x3cba1 (0xe36c bytes)
+4( 4 mod 256): PUNCH 0x2e7 thru 0x5c42 (0x595c bytes)
+5( 5 mod 256): MAPWRITE 0xcaea thru 0x13ba9 (0x70c0 bytes)
+6( 6 mod 256): PUNCH 0x31645 thru 0x38d1c (0x76d8 bytes)
+7( 7 mod 256): SKIPPED (no operation)
+8( 8 mod 256): MAPREAD 0x2ed6b thru 0x3413b (0x53d1 bytes)
+9( 9 mod 256): READ 0x907c thru 0xf535 (0x64ba bytes)
+10( 10 mod 256): WRITE 0x37786 thru 0x3b8e0 (0x415b bytes)
+11( 11 mod 256): MAPREAD 0x3a290 thru 0x3ffff (0x5d70 bytes)
+12( 12 mod 256): READ 0x22f86 thru 0x2d9aa (0xaa25 bytes) ***RRRR***
+13( 13 mod 256): SKIPPED (no operation)
+14( 14 mod 256): WRITE 0x32ae3 thru 0x3a478 (0x7996 bytes)
+15( 15 mod 256): MAPWRITE 0x5acb thru 0x843a (0x2970 bytes)
+16( 16 mod 256): SKIPPED (no operation)
+17( 17 mod 256): SKIPPED (no operation)
+18( 18 mod 256): READ 0x279cd thru 0x28094 (0x6c8 bytes)
+19( 19 mod 256): TRUNCATE DOWN from 0x40000 to 0x61da ******WWWW
+20( 20 mod 256): MAPWRITE 0x2517c thru 0x2d393 (0x8218 bytes) ******WWWW
+21( 21 mod 256): READ 0x293fe thru 0x2d393 (0x3f96 bytes)
+22( 22 mod 256): MAPWRITE 0x27b51 thru 0x2e5a2 (0x6a52 bytes)
+23( 23 mod 256): TRUNCATE UP from 0x2e5a3 to 0x2e744
+24( 24 mod 256): SKIPPED (no operation)
+25( 25 mod 256): SKIPPED (no operation)
+26( 26 mod 256): SKIPPED (no operation)
+27( 27 mod 256): MAPREAD 0xcdd4 thru 0x13e17 (0x7044 bytes)
+28( 28 mod 256): MAPWRITE 0x1a705 thru 0x1f5e0 (0x4edc bytes)
+29( 29 mod 256): MAPREAD 0x1c452 thru 0x1d63a (0x11e9 bytes)
+30( 30 mod 256): MAPWRITE 0x373fc thru 0x3760d (0x212 bytes)
+31( 31 mod 256): SKIPPED (no operation)
+32( 32 mod 256): PUNCH 0x14c02 thru 0x18bd0 (0x3fcf bytes)
+33( 33 mod 256): SKIPPED (no operation)
+34( 34 mod 256): PUNCH 0x26117 thru 0x30089 (0x9f73 bytes) ******PPPP
+35( 35 mod 256): SKIPPED (no operation)
+36( 36 mod 256): SKIPPED (no operation)
+37( 37 mod 256): WRITE 0x3845 thru 0xc58f (0x8d4b bytes)
+38( 38 mod 256): WRITE 0x2d2e9 thru 0x38bf7 (0xb90f bytes) EXTEND
+39( 39 mod 256): SKIPPED (no operation)
+40( 40 mod 256): WRITE 0x7a3 thru 0x3d5f (0x35bd bytes)
+41( 41 mod 256): WRITE 0xd6eb thru 0x14efe (0x7814 bytes)
+42( 42 mod 256): WRITE 0x201f thru 0x3bd0 (0x1bb2 bytes)
+43( 43 mod 256): SKIPPED (no operation)
+44( 44 mod 256): SKIPPED (no operation)
+45( 45 mod 256): SKIPPED (no operation)
+46( 46 mod 256): MAPREAD 0x15931 thru 0x1ce44 (0x7514 bytes)
+47( 47 mod 256): MAPREAD 0x410f thru 0xee3d (0xad2f bytes)
+48( 48 mod 256): READ 0x2002 thru 0x8e32 (0x6e31 bytes)
+49( 49 mod 256): READ 0x26627 thru 0x2778d (0x1167 bytes) ***RRRR*** |
Duplicated the failed xfstests/127 here. For some reason, I only looked into the fallocate-related xfstests and realized I couldn't use them (for a reason I already forgot). I'm looking into this one. |
Adding a call to |
@dweeezil : I have noticed 4 issues with the fallocate(2) patch, even I had started to work on this and have a patch quite similar as yours :)
|
@rohan-puri All very good points: I was looking at other file native file systems as a reference but I can see that what you're describing is pretty much what ext4 does. For some reason, I didn't look at ext4 very closely. Here's my thoughts on these issues:
I'll try to get a refresh of this patch pushed this weekend. |
Re-pushed with inode locking added and also with use of truncate_pagecache_range() rather thantruncate_inode_pages_range(). Tested with xfstests generic/127 under 2.6.32.61 and 3.13.7. |
The after adding the inode lock, fsstress has uncovered a deadlock. I'm looking into it. |
Added a missing unlock in the error case and re-pushed. No more deadlock when running fsstress. |
@dweeezil : Great :)
|
According to the Linux documentation, the FALLOC_FL_KEEP_SIZE flag must be set in conjunction with the FALLOC_FL_PUNCH_HOLE flag. The previous logic caused a failure whenever FALLOC_FL_KEEP_SIZE was present. Also, mimic the behavior of other native file systems such as ext4 in cases where the file might be extended. If the offset is beyond the end of the file, return success without changing the file. If the extent of the punched hole would extend the file, only the existing tail of the file is punched. Add an autotools check for truncate_pagecache_range(). References: https://git.kernel.org/cgit/docs/man-pages/man-pages.git/tree/man2/fallocate.2#n102
Added a test to return EINVAL when the length is <= 0. I'll not have a chance to look at your points 3&4, @rohan-puri until later. |
@behlendorf I'll try to get this re-worked and also to base it on your recent zfs_trunc() fix. |
I'm going to close this one and re-open it with a more appropriate description once I re-work it (which should be pretty soon). |
According to the Linux documentation, the FALLOC_FL_KEEP_SIZE flag must be
set in conjunction with the FALLOC_FL_PUNCH_HOLE flag. The previous
logic caused a failure whenever FALLOC_FL_KEEP_SIZE was present.
Also, mimic the behavior of other native file systems such as ext4 in
cases where the file might be extended. If the offset is beyond the end
of the file, return success without changing the file. If the extent of
the punched hole would extend the file, only the existing tail of the
file is punched.
References:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/tree/man2/fallocate.2#n102